Predicting Screening Coagulation Studies Using Coagulation Factor Results

By Amrom Obstfeld

Overview

The interpretation of coagulation studies in clinical practice can be hampered by lack of understanding of the relationship between screening coagulation studies and followup coagulation factor results that are ordered in order to explain abnormalities in these studies. In order to elucidate these relationships I will acquire deideintified laboratory results from CHOP and HUP data warehouses and use statistical and machine learning tools. My ultimate goal is to use these tools in clinical practice to guide clinicians towards appropriate coagulation tests.

Introduction

Screening coagulation studies such as the partial thromboplastin time (PTT) and the prothrombin time (PT) are used to assess the risk of bleeding patients. When abnormalities are seen in one or both of these, additional testing is performed to identify the specific blood coagulation factors that are abnormal. The univariate relationships between the screening studies and the specific factors have been elucidated using empiric laboratory evidence. However these relationships have not been validated using real patient data. Furthermore the multivariate relationships are more difficult to assess in the laboratory. Using accurate real-world models will allow hematologists and coagulation laboratories to better assess the bleeding risk of patients and suggest additional studies when screening results are inconsistent with the results of the coagulation factor levels.

Addressing this problem requires a multidisciplinary approach. Specifically, in depth knowledge of the pathophysiology of several hematological disorders is necessary in order to steer clear of patients with data that would obscure this relationship. For instance, patients with elevated PTT levels as a results of anti-phospholipid antibodies would confound results due to the kinetics of the antibodies present in this situation. Input from pathology and laboratory medicine brings an in depth knowledge of the way the data was generated and allows the project to avoid additional potential confounding variables. For instance, the precise relationship between screening studies and coagulation factor levels may be impacted by changes in reagent manufacturers, changes in lots, or changes in instrumentation. Finally, data scientists provide insight into the process of exploring data and using knowledge of the data, such as its size, quality, and nature, to inform the classifier development approach.

Methods

Two independent datasets containing results from coagulation testing were obtained. The first of these data sets was obtained from a clinical laboratory data repository maintained by the Department of Pathology at Penn Medicine. The second of these datasets was exported out of the CHOP Data warehouse.

library(readr)
# penn <- read_csv(
#   "~/Rprojects/EPID600_Final_Project/coag_res.csv",
#   col_types = cols(
#     drawn_date = col_datetime(format = "%m/%d/%Y %H:%M"),
#     ord_date = col_datetime(format = "%m/%d/%Y %H:%M"),
#     pat_adm_dt = col_datetime(format = "%m/%d/%Y %H:%M"),
#     result = col_character()
#   )
# )

# penn <-
#   read_csv(
#     "~/Rprojects/EPID600_Final_Project/coag_res.v2.csv.gz",
#     col_types = cols(
#       drawn_date = col_datetime(format = "%m/%d/%Y %H:%M"),
#       ord_date = col_datetime(format = "%m/%d/%Y %H:%M"),
#       pat_adm_dt = col_datetime(format = "%m/%d/%Y %H:%M"),
#       result = col_character()
#     )
#   )
#save(penn,file='penndata.Rdata')
#save(penn,file='penndata2.Rdata')
load('penndata.Rdata')

Summary views of the data were reviewed

str(data.frame(penn))
## 'data.frame':    1098194 obs. of  29 variables:
##  $ pat_sex     : chr  "Male" "Male" "Male" "Male" ...
##  $ pat_race    : chr  "White" "White" "White" "White" ...
##  $ pat_adm_dt  : POSIXct, format: NA NA ...
##  $ pat_type    : chr  "(N) No Phlebotomy Client-CCA" "(N) No Phlebotomy Client-CCA" "(N) No Phlebotomy Client-CCA" "(N) No Phlebotomy Client-CCA" ...
##  $ accession   : int  57145 96108 104132 108359 134995 192434 199253 243811 256569 282727 ...
##  $ order_name  : chr  "PT" "PT" "PT" "PT" ...
##  $ task_name   : chr  "PT" "PT" "PT" "PT" ...
##  $ service_res : chr  "HUP Destiny 2" "HUP Destiny 2" "HUP Destiny 2" "HUP Destiny 2" ...
##  $ priority    : chr  "RT - Routine" "RT - Routine" "RT - Routine" "RT - Routine" ...
##  $ ord_date    : POSIXct, format: NA NA ...
##  $ drawn_date  : POSIXct, format: NA NA ...
##  $ result      : chr  "16.4" "14.6" "14.5" "20.5" ...
##  $ result_units: chr  "second(s)" "second(s)" "second(s)" "second(s)" ...
##  $ normal_low  : num  10.7 10.7 10.7 10.7 10.7 11.2 11.2 11.2 11.2 11.2 ...
##  $ normal_high : num  13.8 13.8 13.8 13.8 13.8 13.5 13.5 13.5 13.5 13.5 ...
##  $ fin_class   : chr  NA NA NA NA ...
##  $ empi        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ abn_flag    : chr  "H" "H" "H" "H" ...
##  $ verif_dt_tm : POSIXct, format: "2015-03-02 09:28:38" "2015-04-13 16:05:38" ...
##  $ perf_dt_tm  : POSIXct, format: "2015-03-02 09:28:38" "2015-04-13 16:05:38" ...
##  $ recvd_dt_tm : POSIXct, format: "2015-03-02 09:01:38" "2015-04-13 15:38:38" ...
##  $ updt_dt_tm  : POSIXct, format: "2015-03-03 00:45:04" "2015-04-14 00:53:43" ...
##  $ value       : num  16.4 14.6 14.5 20.5 32.9 35.3 32.1 34.6 28.8 32.3 ...
##  $ value2      : num  16.4 14.6 14.5 20.5 32.9 35.3 32.1 34.6 28.8 32.3 ...
##  $ interp      : chr  "H" "H" "H" "H" ...
##  $ ptile       : num  1 0.999 0.998 1 1 ...
##  $ group       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ deltat      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ N           : int  1 1 1 1 1 1 1 1 1 1 ...
summary(penn)
##    pat_sex            pat_race           pat_adm_dt     
##  Length:1098194     Length:1098194     Min.   :NA       
##  Class :character   Class :character   1st Qu.:NA       
##  Mode  :character   Mode  :character   Median :NA       
##                                        Mean   :NA       
##                                        3rd Qu.:NA       
##                                        Max.   :NA       
##                                        NA's   :1098194  
##    pat_type           accession       order_name         task_name        
##  Length:1098194     Min.   :     1   Length:1098194     Length:1098194    
##  Class :character   1st Qu.:158328   Class :character   Class :character  
##  Mode  :character   Median :318075   Mode  :character   Mode  :character  
##                     Mean   :318884                                        
##                     3rd Qu.:479146                                        
##                     Max.   :640013                                        
##                     NA's   :11                                            
##  service_res          priority            ord_date         drawn_date     
##  Length:1098194     Length:1098194     Min.   :NA        Min.   :NA       
##  Class :character   Class :character   1st Qu.:NA        1st Qu.:NA       
##  Mode  :character   Mode  :character   Median :NA        Median :NA       
##                                        Mean   :NA        Mean   :NA       
##                                        3rd Qu.:NA        3rd Qu.:NA       
##                                        Max.   :NA        Max.   :NA       
##                                        NA's   :1098194   NA's   :1098194  
##     result          result_units         normal_low      normal_high    
##  Length:1098194     Length:1098194     Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.: 11.20   1st Qu.: 13.80  
##  Mode  :character   Mode  :character   Median : 20.80   Median : 34.40  
##                                        Mean   : 21.97   Mean   : 37.45  
##                                        3rd Qu.: 20.80   3rd Qu.: 34.40  
##                                        Max.   :210.00   Max.   :453.00  
##                                        NA's   :4179     NA's   :4179    
##   fin_class              empi          abn_flag        
##  Length:1098194     Min.   :     1   Length:1098194    
##  Class :character   1st Qu.: 28744   Class :character  
##  Mode  :character   Median : 57482   Mode  :character  
##                     Mean   : 56835                     
##                     3rd Qu.: 84631                     
##                     Max.   :113888                     
##                     NA's   :1                          
##   verif_dt_tm                    perf_dt_tm                 
##  Min.   :2014-12-18 08:04:50   Min.   :2014-12-18 08:04:50  
##  1st Qu.:2015-06-22 02:45:49   1st Qu.:2015-06-22 02:43:51  
##  Median :2015-12-15 04:22:08   Median :2015-12-15 04:14:08  
##  Mean   :2015-12-13 20:08:43   Mean   :2015-12-13 20:03:38  
##  3rd Qu.:2016-06-04 00:23:47   3rd Qu.:2016-06-04 00:23:12  
##  Max.   :2016-12-06 20:06:25   Max.   :2016-12-06 20:06:25  
##                                                             
##   recvd_dt_tm                    updt_dt_tm                 
##  Min.   :2014-11-25 07:11:39   Min.   :2014-12-19 06:42:53  
##  1st Qu.:2015-06-22 01:24:04   1st Qu.:2015-06-23 04:08:06  
##  Median :2015-12-15 03:05:07   Median :2015-12-15 23:49:20  
##  Mean   :2015-12-13 18:48:36   Mean   :2015-12-15 01:14:37  
##  3rd Qu.:2016-06-03 23:20:00   3rd Qu.:2016-06-04 19:36:27  
##  Max.   :2016-12-06 19:52:25   Max.   :2016-12-07 06:32:36  
##                                                             
##      value             value2          interp              ptile      
##  Min.   :   0.00   Min.   :   0.0   Length:1098194     Min.   :0.000  
##  1st Qu.:  14.60   1st Qu.:  14.6   Class :character   1st Qu.:0.667  
##  Median :  26.30   Median :  26.4   Mode  :character   Median :0.982  
##  Mean   :  39.13   Mean   :  39.9                      Mean   :0.804  
##  3rd Qu.:  35.40   3rd Qu.:  35.7                      3rd Qu.:1.000  
##  Max.   :4444.00   Max.   :4444.0                      Max.   :1.000  
##  NA's   :8347      NA's   :1989                        NA's   :4179   
##      group            deltat               N         
##  Min.   :     0   Min.   : 0.00000   Min.   : 1.000  
##  1st Qu.:162827   1st Qu.: 0.00000   1st Qu.: 1.000  
##  Median :321367   Median : 0.00000   Median : 1.000  
##  Mean   :318007   Mean   : 0.01064   Mean   : 1.499  
##  3rd Qu.:474259   3rd Qu.: 0.00000   3rd Qu.: 2.000  
##  Max.   :625306   Max.   :10.91667   Max.   :32.000  
##  NA's   :12
#May want to explore how NA's are correlate, may want to see how the two value fields relate to each other and to the result.

Columns of relevance were selected

library(dplyr)
cols<-c(5,17,21,7,12,14,15)
penn<-penn%>%select(cols)

Task_name is the variable that holds the test name. Need to filter dataset to tests relevant to the PT and PTT screening tests

table(penn$task_name)
## 
##            DTT      Factor II      Factor IX       Factor V     Factor VII 
##             44            193            285            208            218 
##    Factor VIII       Factor X      Factor XI     Factor XII    Factor XIII 
##           1722            204            227             36             48 
##     Fibrinogen             PT   PT 0 Patient  PT 30 Patient  PT 60 Patient 
##          35955         530466             91             91             11 
##        PT TEMP          PT WB            PTT  PTT 0 Patient PTT 30 Patient 
##          40667            129         485547            148            148 
## PTT 60 Patient         PTT WB Reptilase Test            TTI 
##             23             12            495           1226
#filter out extraneous testing pulled out of warehouse
penn<-penn%>%filter(!task_name %in% c('Factor XIII', 'PT 0 Patient',  'PT 30 Patient','PT TEMP','PTT 0 Patient', 'PTT 30 Patient','PTT 60 Patient','TTI','PT WB','PTT WB','DTT','PT 60 Patient'))

All ‘result’ values should be numeric, non-numeric were explored.

#Index by na's introduced by as.numeric coercion
penn%>%select(4,5)%>%filter(is.na(as.numeric(result)))%>%table
##                 result
## task_name        < 45 < 60   <1 <15.0 <150.0  <45  <50  <60 > 1400 > 90
##   Factor II         0    0    0     0      0    0    0    0      0    0
##   Factor IX         0    0    8     0      0    0    0    0      0    0
##   Factor V          0    0    1     0      0    0    0    0      0    0
##   Factor VII        0    0    0     0      0    0    0    0      0    0
##   Factor VIII       0    0   68     0      0    0    0    0      0    0
##   Factor X          0    0    2     0      0    0    0    0      0    0
##   Factor XI         0    0    6     0      0    0    0    0      0    0
##   Fibrinogen        4    1    0     0      0   67    7    1      9    0
##   PT                0    0    0     0      0    0    0    0      0   76
##   PTT               0    0    0     2      1    0    0    0      0    0
##   Reptilase Test    0    0    0     0      0    0    0    0      0    0
##                 result
## task_name          >1 >100.0 >1000 >140.0 >1400 >150 >150.0  >90 >90.0
##   Factor II         0      0     0      0     0    0      0    0     0
##   Factor IX         0      0     0      0     0    0      0    0     0
##   Factor V          0      0     0      0     0    0      0    0     0
##   Factor VII        0      0     0      0     0    0      0    0     0
##   Factor VIII       1      0     1      0     0    0      0    0     0
##   Factor X          0      0     0      0     0    0      0    0     0
##   Factor XI         0      0     0      0     0    0      0    0     0
##   Fibrinogen        0      0     1      0    86    0      0    0     0
##   PT                0      0     0      0     0    0      0   55   214
##   PTT               0      0     0      3     0  600   5131    0     0
##   Reptilase Test    0      2     0      0     0    0      0    0     0
##                 result
## task_name        >900 clot Clott clotted Clotted clotted Sample
##   Factor II         0    0     0       0       0              0
##   Factor IX         0    0     0       0       0              0
##   Factor V          0    0     0       0       0              0
##   Factor VII        0    0     0       0       0              0
##   Factor VIII       0    0     0       0       0              0
##   Factor X          0    0     0       0       0              0
##   Factor XI         0    0     0       0       0              0
##   Fibrinogen        7    0     0       1       0              0
##   PT                0    2     1      12       5              0
##   PTT               0    2     0      14       4              1
##   Reptilase Test    0    0     0       0       0              0
##                 result
## task_name        Delayed sample diregard disregard Disregard hemolysed
##   Factor II                   0        0         0         0         0
##   Factor IX                   0        0         0         0         0
##   Factor V                    0        0         0         0         0
##   Factor VII                  0        0         0         0         0
##   Factor VIII                 0        0         0         0         0
##   Factor X                    0        0         0         0         0
##   Factor XI                   0        0         0         0         0
##   Fibrinogen                  0        0         0        13         0
##   PT                          4        0         8       282         2
##   PTT                         0        1         3       285         2
##   Reptilase Test              0        0         0         0         0
##                 result
## task_name        Hemolysed hemolyzed Hemolyzed invalid Invalid sample
##   Factor II              0         0         0       0              0
##   Factor IX              0         0         0       0              0
##   Factor V               0         0         0       0              0
##   Factor VII             0         0         0       0              0
##   Factor VIII            0         0         0       0              0
##   Factor X               0         0         0       0              0
##   Factor XI              0         0         0       0              0
##   Fibrinogen             0         0         0       0              0
##   PT                     0         2         1       1              1
##   PTT                    1         0         1       1              1
##   Reptilase Test         0         0         0       0              0
##                 result
## task_name        MISLABEL mislabeled Mislabeled   na NO RESULT not done
##   Factor II             0          0          0    0         0        0
##   Factor IX             0          0          0    0         0        0
##   Factor V              0          0          0    0         0        0
##   Factor VII            0          0          0    0         0        0
##   Factor VIII           0          0          0    0         0        0
##   Factor X              0          0          0    0         0        0
##   Factor XI             0          0          0    0         0        0
##   Fibrinogen            1          0          0    0         1        0
##   PT                    2          1          1    1         3        1
##   PTT                   2          1          1    2         1        0
##   Reptilase Test        0          0          0    0         0        0
##                 result
## task_name         qns  QNS Sample Clotted Sample mislabel see comment
##   Factor II         0    0              0               0           0
##   Factor IX         0    0              0               0           1
##   Factor V          0    0              0               0           0
##   Factor VII        0    0              0               0           0
##   Factor VIII       0    0              0               0           1
##   Factor X          0    0              0               0           0
##   Factor XI         0    0              0               0           1
##   Fibrinogen        6    0              0               0           0
##   PT               17    4              0               2           1
##   PTT              11    3              1               1           1
##   Reptilase Test    0    0              0               0           0
##                 result
## task_name        SEE MEDVIEW see note See note See Note SEE NOTE
##   Factor II                1        0        0        0        0
##   Factor IX                1        0        0        0        0
##   Factor V                 0        0        0        0        0
##   Factor VII               1        0        0        0        0
##   Factor VIII              0        0        0        0        0
##   Factor X                 1        0        0        0        0
##   Factor XI                0        0        0        0        0
##   Fibrinogen               0        2        0       11        0
##   PT                       0       22        1      137        0
##   PTT                      0       14        1       77        1
##   Reptilase Test           0        0        0        0        0
##                 result
## task_name        Spec Hemolyzed   xx   XX  xxx  XXX xxxx
##   Factor II                   0    0    0    1    1    0
##   Factor IX                   0    0    0    0    1    0
##   Factor V                    0    0    0    1    2    0
##   Factor VII                  0    0    0    1    1    0
##   Factor VIII                 0    0    0    1    1    0
##   Factor X                    0    0    0    1    1    0
##   Factor XI                   0    0    0    0    1    0
##   Fibrinogen                  0    0    0   34    4    0
##   PT                          1    1    4  340   34    1
##   PTT                         1    3    3  309   27    3
##   Reptilase Test              0    0    0    0    0    0

All textbased comments in the result field indicate fundamental issues with the result and should be filtered out. Many of the remainder are due to a measurement being out of range high or low. most of these can be kept and replaced with the numeric value indicated in the field for the purposes of this analysis

#Explore weird results
#Replace with NA those results without numbers
penn$result[grepl("[:alpha:]",penn$result)]<-NA
penn$result[grepl("[:a-z:]",penn$result)]<-NA
penn$result[grepl("[:A-Z:]",penn$result)]<-NA
#Results that are 'greater than' a number (such as 150, 90, etc) should be replaced with that number. 
penn$result<-gsub(">", "", penn$result) 
#Less than with a small number should be replaced by 0. 
penn$result<-gsub("<", "", penn$result)
#Text results should be filtered out of dataset.
penn<-penn%>%filter(!is.na(result))
#Convert result variable to numeric
penn$result<-as.numeric(penn$result)
#confirm removal of all non-numeric results
penn%>%filter(is.na(result))%>%count%>%as.numeric
## [1] 0

A histogram was created in order to see the amount of each kind of result in the dataset. The plot demonstrates that the vast majority of tests included only screening tests and not individual factors. These will have to be filtered from the data set (see below).

library(knitr)
library(ggplot2)
ggplot(penn)+
  geom_bar(aes(x=task_name))+
  scale_y_log10()

knitr::kable(penn%>%group_by(task_name)%>%summarise(n=n())%>%arrange(desc(n)),col.names = c('Test','Number'))
Test Number
PT 529569
PTT 484764
Fibrinogen 35882
Factor VIII 1719
Reptilase Test 495
Factor IX 282
Factor XI 225
Factor VII 215
Factor V 205
Factor X 201
Factor II 190
Factor XII 36

The accession number is essentially the order number. This number groups labs that were performed on the same sample. This is the unit upon which we will group the dependent and independent variables. The ‘empi’ is also important as it represents the patient identifier in this de-identified data set. To facilitate analysis, the data was reshaped to a spread format, such that the independent and dependent variables now present in the ‘task_name’ variables are spread out in their own columns. Results are grouped by accession number. Before doing so, data cleaning was performed.

library(tidyr)
library(lubridate)
#Before spreading data need to ensure that all entries are unique
dblresults<-penn%>%group_by(accession,task_name)%>%summarise(n=n())%>%ungroup()%>%filter(n>1)
table(dblresults$task_name)
## 
##      Factor II     Factor VII    Factor VIII       Factor X     Fibrinogen 
##              1              1              3              2             17 
##             PT            PTT Reptilase Test 
##            196            164              1
dupresults<-penn%>%group_by(accession,task_name,result)%>%summarise(n=n())%>%select(1,2,4)%>%ungroup()%>%filter(n>1)
table(dupresults$task_name)
## 
##   Factor II  Factor VII Factor VIII    Factor X  Fibrinogen          PT 
##           1           1           1           2           4          77 
##         PTT 
##          66

There were 152 entries with duplicated data. There were 233 with two results for the same test in the same accession.

Overall the number of duplications are fairly small. This appears to be due to duplication of a screening test (PT, PTT, or fibrinogen) off of an identical draw time. The percent difference between these was calculated and results filtered for tests discordant by more than a CV of 20%. A 20% CV would be considered a significant difference between lab results in the field of clinical coagulation.

acc<-left_join(dblresults,penn, by=c('accession' = 'accession', 'task_name' = 'task_name'))
acc%>%group_by(accession,task_name)%>%summarise(n=n(),mean=mean(result),stdev=sd(result),cv=sd(result)/mean(result))%>%filter(n>1)%>%filter(cv>0.2)
## Source: local data frame [9 x 6]
## Groups: accession [8]
## 
##   accession      task_name     n   mean       stdev        cv
##       <int>          <chr> <int>  <dbl>       <dbl>     <dbl>
## 1     79354 Reptilase Test     2 860.70 1192.606297 1.3856237
## 2    102747            PTT     2 113.40   23.475945 0.2070189
## 3    237294            PTT     2  78.55   16.192745 0.2061457
## 4    240860    Factor VIII     2  78.00   18.384776 0.2357023
## 5    260773            PTT     2  26.45    6.293250 0.2379301
## 6    290958             PT     2  65.25   35.001786 0.5364258
## 7    391698            PTT     2  36.25    7.283200 0.2009159
## 8        NA             PT     6  18.25    6.445696 0.3531888
## 9        NA            PTT     5  65.90   45.347767 0.6881300

Nine of these differences were greater than 20%. As a result the agreement was considered close enough for the analysis in this project and these values were replaced with the average of these results. The following dataframe accounts for this correction and spreads variables within task_name across new column fields with values coming from

acc.rm<-acc%>%group_by(accession,task_name)%>%summarise(n=n(),mean=mean(result),stdev=sd(result),cv=sd(result)/mean(result))%>%filter(n>1)%>%filter(cv>.2)

penndf<-anti_join(penn,acc.rm,by=c('accession' = 'accession', 'task_name' = 'task_name'))%>%
group_by(empi,accession,task_name)%>%summarise(result=mean(result))

#Now spread tests across column names for use during regression
penndf<-penndf%>%spread(task_name,result)


str(data.frame(penndf))
## 'data.frame':    598217 obs. of  14 variables:
##  $ empi          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ accession     : int  57145 96108 104132 108359 134995 192434 199253 243811 256569 282727 ...
##  $ Factor.II     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.IX     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.V      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.VII    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.VIII   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.X      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.XI     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Factor.XII    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Fibrinogen    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PT            : num  16.4 14.6 14.5 20.5 32.9 35.3 32.1 34.6 28.8 32.3 ...
##  $ PTT           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Reptilase.Test: num  NA NA NA NA NA NA NA NA NA NA ...
summary(penndf)
##       empi          accession        Factor II        Factor IX     
##  Min.   :     1   Min.   :     1   Min.   :  1.0    Min.   :  1.0   
##  1st Qu.: 27050   1st Qu.:158861   1st Qu.: 38.0    1st Qu.: 29.2   
##  Median : 55431   Median :318762   Median : 65.0    Median : 59.0   
##  Mean   : 55410   Mean   :319433   Mean   : 63.9    Mean   : 61.7   
##  3rd Qu.: 82942   3rd Qu.:479981   3rd Qu.: 90.0    3rd Qu.: 84.0   
##  Max.   :113888   Max.   :640013   Max.   :158.0    Max.   :183.0   
##  NA's   :1                         NA's   :598028   NA's   :597935  
##     Factor V        Factor VII      Factor VIII        Factor X     
##  Min.   :  1      Min.   :  1.0    Min.   :   1.0   Min.   :  1.0   
##  1st Qu.: 53      1st Qu.: 34.2    1st Qu.:  71.0   1st Qu.: 37.0   
##  Median : 75      Median : 53.0    Median : 114.0   Median : 62.0   
##  Mean   : 76      Mean   : 58.9    Mean   : 133.8   Mean   : 60.4   
##  3rd Qu.:100      3rd Qu.: 78.0    3rd Qu.: 179.0   3rd Qu.: 81.0   
##  Max.   :239      Max.   :189.0    Max.   :1000.0   Max.   :164.0   
##  NA's   :598012   NA's   :598003   NA's   :596502   NA's   :598018  
##    Factor XI        Factor XII       Fibrinogen           PT       
##  Min.   :  1.0    Min.   :  8.0    Min.   :  45.0   Min.   : 0.5   
##  1st Qu.: 34.0    1st Qu.: 44.0    1st Qu.: 234.0   1st Qu.:13.3   
##  Median : 63.0    Median : 56.5    Median : 356.0   Median :14.5   
##  Mean   : 64.8    Mean   : 66.0    Mean   : 367.9   Mean   :17.2   
##  3rd Qu.: 93.0    3rd Qu.: 79.8    3rd Qu.: 462.0   3rd Qu.:18.2   
##  Max.   :171.0    Max.   :178.0    Max.   :1845.0   Max.   :90.0   
##  NA's   :597992   NA's   :598181   NA's   :562352   NA's   :68850  
##       PTT         Reptilase Test  
##  Min.   : 11.00   Min.   : 14.0   
##  1st Qu.: 28.10   1st Qu.: 16.3   
##  Median : 32.20   Median : 17.4   
##  Mean   : 40.27   Mean   : 18.8   
##  3rd Qu.: 44.20   3rd Qu.: 18.7   
##  Max.   :150.00   Max.   :100.0   
##  NA's   :113625   NA's   :597724

Normal results were pulled from the original data set for future analyses

# Want to insert normal values here as well. Multiple normal values present so will use the ones most frequently in the dataset.
normals<-penn%>%select(4,6,7)%>%group_by(task_name,normal_low,normal_high)%>%summarize(n=n())%>%ungroup()%>%group_by(task_name)%>%filter(n==max(n))%>%select(-n)
normals$task_name<-gsub(" ","",normals$task_name)
normals
## Source: local data frame [12 x 3]
## Groups: task_name [12]
## 
##        task_name normal_low normal_high
##            <chr>      <dbl>       <dbl>
## 1       FactorII       76.0       127.0
## 2       FactorIX       75.0       125.0
## 3        FactorV       65.0       128.0
## 4      FactorVII       66.0       150.0
## 5     FactorVIII       50.0       200.0
## 6        FactorX       73.0       135.0
## 7       FactorXI       60.0       140.0
## 8      FactorXII       50.0       160.0
## 9     Fibrinogen      170.0       410.0
## 10            PT       11.2        13.5
## 11           PTT       20.8        34.4
## 12 ReptilaseTest        0.0        22.0

Most of these orders will likely only have PT or PTT, the screening test and the dependent variables in this analysis, as such these need to be filtered out since they will not contribute to the analysis.

#The following lines labels all rows with NAs in every independent predicting variable as a '10', any number less than 10 means that there is at least one 

penndf[,15]<-as.numeric()
names(penndf)[15]<-"allna"

penndf$allna<-rowSums(apply(is.na(penndf[,c(3:11,14)]), 2, as.numeric))


#Plot distribution of number of results for each acccession
ggplot(penndf)+
  geom_bar(aes(x=allna))+
  scale_y_log10()+
  ggtitle("Distribution of number of missing data per entry")+
  labs(x="Number of factor assays missing",y="Number of entries (log)",title="Distribution of number of missing data per entry")

As a result of the uselessness of having numerous accessions with dependent variablesbut without the predictors or vice versa these have been dropped.

#create final dataset with filtering. Rename variables to remove spaces
penndf<-penndf%>%filter(!(is.na(PT) & is.na(PTT)))

penndf<-penndf%>%filter(allna<10)
#change names because easier to work without space
names(penndf)<-gsub(" ","",names(penndf))
summary(penndf)
##       empi          accession         FactorII        FactorIX     
##  Min.   :    14   Min.   :    14   Min.   :  5.0   Min.   :  1.00  
##  1st Qu.: 41698   1st Qu.:171345   1st Qu.: 38.5   1st Qu.: 29.25  
##  Median : 69822   Median :329978   Median : 65.0   Median : 59.50  
##  Mean   : 64944   Mean   :328226   Mean   : 64.2   Mean   : 61.82  
##  3rd Qu.: 89090   3rd Qu.:486902   3rd Qu.: 90.0   3rd Qu.: 84.00  
##  Max.   :113878   Max.   :639993   Max.   :158.0   Max.   :183.00  
##                                    NA's   :34214   NA's   :34119   
##     FactorV         FactorVII        FactorVIII        FactorX      
##  Min.   :  1.00   Min.   :  2.00   Min.   :   1.0   Min.   :  1.00  
##  1st Qu.: 55.00   1st Qu.: 35.00   1st Qu.:  70.0   1st Qu.: 37.50  
##  Median : 75.00   Median : 53.50   Median : 114.0   Median : 63.00  
##  Mean   : 76.44   Mean   : 59.52   Mean   : 133.5   Mean   : 60.91  
##  3rd Qu.:102.00   3rd Qu.: 78.00   3rd Qu.: 178.0   3rd Qu.: 81.00  
##  Max.   :239.00   Max.   :189.00   Max.   :1000.0   Max.   :164.00  
##  NA's   :34196    NA's   :34187    NA's   :32712    NA's   :34202   
##     FactorXI        FactorXII        Fibrinogen           PT       
##  Min.   :  1.00   Min.   :  8.00   Min.   :  45.0   Min.   : 9.80  
##  1st Qu.: 36.00   1st Qu.: 44.00   1st Qu.: 236.0   1st Qu.:13.00  
##  Median : 63.00   Median : 56.50   Median : 361.0   Median :14.00  
##  Mean   : 65.25   Mean   : 66.03   Mean   : 370.1   Mean   :15.29  
##  3rd Qu.: 93.00   3rd Qu.: 79.75   3rd Qu.: 465.0   3rd Qu.:15.60  
##  Max.   :171.00   Max.   :178.00   Max.   :1845.0   Max.   :90.00  
##  NA's   :34177    NA's   :34361    NA's   :1465     NA's   :341    
##       PTT         ReptilaseTest        allna      
##  Min.   : 16.20   Min.   : 14.00   Min.   :0.000  
##  1st Qu.: 27.40   1st Qu.: 16.27   1st Qu.:9.000  
##  Median : 30.20   Median : 17.40   Median :9.000  
##  Mean   : 34.61   Mean   : 18.78   Mean   :8.941  
##  3rd Qu.: 35.20   3rd Qu.: 18.70   3rd Qu.:9.000  
##  Max.   :150.00   Max.   :100.00   Max.   :9.000  
##  NA's   :872      NA's   :33909

These are the final numbers of tests for the analysis

knitr::kable(penndf%>%gather('task_name','result',3:14)%>%filter(result!='NA')%>%group_by(task_name)%>%summarise(n=n())%>%arrange(desc(n)),col.names = c('Test','Numbber'))
Test Numbber
PT 34056
PTT 33525
Fibrinogen 32932
FactorVIII 1685
ReptilaseTest 488
FactorIX 278
FactorXI 220
FactorVII 210
FactorV 201
FactorX 195
FactorII 183
FactorXII 36

CHOP data

Additional CHOP data was brought in to confirm that the general approach was tenable in other datasets, and as a second set of data to use so that models trained with each set can be tested against the other. The data from CHOP contains many more factor assays, however the results come from over a decade of testing, possibly introducing a confounder.

load('cdwdata.RData')
chop<-cdwdata
names(chop)[2]<-'task_name'
names(chop)[1]<-'col'
names(chop)[10]<-'result'

table(chop$task_name)
## 
##                   FACTOR II                   FACTOR IX 
##                        1558                        3544 
##                    FACTOR V                  FACTOR VII 
##                        1944                        2974 
##                 FACTOR VIII                    FACTOR X 
##                       20867                        1622 
##                   FACTOR XI            FACTOR XII ASSAY 
##                        1400                         808 
##                  FIBRINOGEN PARTIAL THROMBOPLASTIN TIME 
##                       95545                      252449 
##            PROTHROMBIN TIME 
##                      268404
chop[grep("FACTOR XII ASSAY", chop$task_name), 'task_name']<-'FactorXII'
chop[grep("FACTOR VIII", chop$task_name), 'task_name']<-'FactorVIII'
chop[grep("FACTOR VII", chop$task_name), 'task_name']<-'FactorVII'
chop[grep("FACTOR II", chop$task_name), 'task_name']<-'FactorII'
chop[grep("FACTOR V", chop$task_name), 'task_name']<-'FactorV'
chop[grep("FACTOR IX", chop$task_name), 'task_name']<-'FactorIX'
chop[grep("FACTOR XI", chop$task_name), 'task_name']<-'FactorXI'
chop[grep("FACTOR X", chop$task_name), 'task_name']<-'FactorX'

chop[grep("PROTHROMBIN TIME", chop$task_name), 'task_name']<-'PT'
chop[grep("PARTIAL THROMBOPLASTIN TIME", chop$task_name), 'task_name']<-'PTT'
chop[grep("FIBRINOGEN", chop$task_name), 'task_name']<-'Fibrinogen'

Non-numeric results were explored as above.

#Index by na's introduced by as.numeric coercion
chop%>%select(task_name,result)%>%filter(is.na(as.numeric(result)))%>%table
##             result
## task_name          < 1 < 1.0 < 10.0 < 15 < 15.0 < 150.0 < 20.0  < 3 < 35
##   FactorII      1    1     0      0    0      0       0      0    1    0
##   FactorIX    105   73     2      0    0      0       0      0    0    0
##   FactorV      25    0     0      0    0      0       0      0    0    0
##   FactorVII     2    9     0      0    0      0       0      0    0    0
##   FactorVIII   75  455     3      0    0      0       0      0    0    0
##   FactorX       4    1     0      0    0      0       0      0    0    0
##   FactorXI     21   29     3      0    0      0       0      0    0    0
##   FactorXII    19    4     0      0    0      0       0      0    0    0
##   Fibrinogen   12    0     0      0   16      0       0      0    0   50
##   PT           77    0     0      1    0      0       0      0    0    0
##   PTT        1586    0     0      0    0      4       2      1    0    0
##             result
## task_name    < 45 < 49 < 5.0 < 50   <1 <1.0 <15.0 <250.0  <45  <49  <50
##   FactorII      0    0     0    0    1    0     0      0    0    0    0
##   FactorIX      0    0     0    0   26    1     0      0    0    0    0
##   FactorV       0    0     0    0    0    0     0      0    0    0    0
##   FactorVII     0    0     0    0    5    0     0      0    0    0    0
##   FactorVIII    0    0     0    0  233    1     0      0    0    0    0
##   FactorX       0    0     0    0    2    0     0      0    0    0    0
##   FactorXI      0    0     0    0    5    0     0      0    0    0    0
##   FactorXII     0    0     0    0    1    0     0      0    0    0    0
##   Fibrinogen  225    3     0    5    0    0     0      0  138    2    1
##   PT            0    0     1    0    0    0     0      0    0    0    0
##   PTT           0    0     0    0    0    0     2      1    0    0    0
##             result
## task_name    > 100.0 > 120 > 120.0 > 125.0 > 1400 > 150.0 > 200.0 > 240.0
##   FactorII         0     0       0       0      0       0       0       0
##   FactorIX         0     0       0       0      0       0       0       0
##   FactorV          0     0       0       0      0       0       0       0
##   FactorVII        0     0       0       0      0       0       0       0
##   FactorVIII       0     0       0       0      0       0       0       0
##   FactorX          0     0       0       0      0       0       0       0
##   FactorXI         0     0       0       0      0       0       0       0
##   FactorXII        0     0       0       0      0       0       0       0
##   Fibrinogen       0     1       0       0     11       0       0       0
##   PT               1     0       0      93      0       1       0       0
##   PTT              1     0       1       0      0     217       1       3
##             result
## task_name    > 25.0 > 250 > 250.0 > 40.0 > 422 > 50.0 > 75.0 > 900 >120.0
##   FactorII        0     0       0      0     0      0      0     0      0
##   FactorIX        0     0       0      0     0      0      0     0      0
##   FactorV         0     0       0      0     0      0      0     0      0
##   FactorVII       0     0       0      0     0      0      0     0      0
##   FactorVIII      0     0       0      0     0      0      0     0      0
##   FactorX         0     0       0      0     0      0      0     0      0
##   FactorXI        0     0       0      0     0      0      0     0      0
##   FactorXII       0     0       0      0     0      0      0     0      0
##   Fibrinogen      0     0       0      0     2      0      0     7      0
##   PT              0     0       0      3     0     38     33     0      1
##   PTT             1    11    1909      0     0      0      1     0      0
##             result
## task_name    >125.0 >145 >150 >150.0 >250.0 >250.0@L  >45  >50 >844 13.9@L
##   FactorII        0    0    0      0      0        0    0    0    0      0
##   FactorIX        0    0    0      0      0        0    0    0    0      0
##   FactorV         0    0    0      0      0        0    0    0    0      0
##   FactorVII       0    0    0      0      0        0    0    0    0      0
##   FactorVIII      0    1    0      0      0        0    0    0    0      0
##   FactorX         0    0    0      0      0        0    0    0    0      0
##   FactorXI        0    0    0      0      0        0    0    0    0      0
##   FactorXII       0    0    0      0      0        0    0    0    0      0
##   Fibrinogen      0    0    0      0      0        0    1    0    3      0
##   PT             70    0    0      0      0        0    0   97    0      1
##   PTT             2    0 1026      2   1360        1    0    0    0      0
##             result
## task_name    15.0@L 185@L 52@L 70@L CANCELED Insuff Quant   ND NOT AVAIL.
##   FactorII        0     0    0    0        0            2    0          0
##   FactorIX        0     0    0    1        0            1    0          0
##   FactorV         0     0    0    0        0            1    0          0
##   FactorVII       0     0    0    0        0            2    0          0
##   FactorVIII      0     1    0    0        0           14    0          0
##   FactorX         0     0    0    0        0            4    0          0
##   FactorXI        0     0    0    0        0            1    0          0
##   FactorXII       0     0    1    0        0            0    0          0
##   Fibrinogen      0     0    0    0        0          108    0          0
##   PT              1     0    0    0       25          378    0          2
##   PTT             0     0    0    0        0          367    1          0
##             result
## task_name    Not Done NOT DONE PENDING see comment See Comment SEE COMMENT
##   FactorII         61        0       5           0           0           0
##   FactorIX        101        1       6           1           0           0
##   FactorV          92        0       4           1           0           0
##   FactorVII       127        0       5           0           0           0
##   FactorVIII      365        0      10           3           2           1
##   FactorX          98        0       3           0           0           0
##   FactorXI         43        0       3           1           0           0
##   FactorXII        27        0       1           0           0           0
##   Fibrinogen     2481        0       4           0           0           0
##   PT             7887        0     177           0           0           0
##   PTT            7816        0      14           1           0           0
##             result
## task_name    See Comment @L see comment for result See Comment@L
##   FactorII                0                      0             0
##   FactorIX                0                      0             0
##   FactorV                 0                      0             0
##   FactorVII               0                      0             0
##   FactorVIII              0                      1             0
##   FactorX                 0                      0             0
##   FactorXI                0                      0             0
##   FactorXII               0                      0             0
##   Fibrinogen              0                      0             0
##   PT                      0                      0             0
##   PTT                     3                      0             2
##             result
## task_name    Test not performed  TNP
##   FactorII                    0    0
##   FactorIX                    0    0
##   FactorV                     0    0
##   FactorVII                   0    0
##   FactorVIII                  4    0
##   FactorX                     0    0
##   FactorXI                    0    0
##   FactorXII                   0    0
##   Fibrinogen                  5    0
##   PT                          8    6
##   PTT                         9    0

All text-based comments in the result field indicate fundamental issues with the result and should be filtered out. Many of the remainder are due to a measurement being out of range high or low. Most of these can be kept and replaced with the numeric value indicated in the field for the purposes of this analysis

#Explore weird results
#Replace with NA those results without numbers
chop$result<-gsub("@L", "", chop$result) 
chop$result[grepl("[:alpha:]",chop$result)]<-NA
chop$result[grepl("[:a-z:]",chop$result)]<-NA
chop$result[grepl("[:A-Z:]",chop$result)]<-NA
chop$result[chop$result==""]<-NA
#Results that are 'greater than' a number (such as 150, 90, etc) should be replaced with that number. 
chop$result<-gsub(">", "", chop$result) 
#Less than with a small number should be replaced by 0. 
chop$result<-gsub("<", "", chop$result)
#Text results should be filtered out of dataset.
chop<-chop%>%filter(!is.na(result))

#Convert result variable to numeric
chop$result<-as.numeric(chop$result)
#confirm removal of all non-numeric results
chop%>%filter(is.na(result))%>%count%>%as.numeric
## [1] 0

Below is a table of the available results

table(chop$task_name)
## 
##   FactorII   FactorIX    FactorV  FactorVII FactorVIII    FactorX 
##       1489       3329       1821       2838      20392       1513 
##   FactorXI  FactorXII Fibrinogen         PT        PTT 
##       1331        761      92935     259844     242650

The CHOP data did not have a single identifier to indicate a single blood draw upon which testing was performed. To begin to select patients of interest, the dataset was limited to patients who had ever had factor testing.

#Filter out to tests for patients that have had factor testing done.
empi<-chop%>%filter(!task_name %in% c('PTT','PT','Fibrinogen'))%>%select(empi)%>%distinct()
chop.fc<-semi_join(chop,empi,by='empi')

Since no accession number was available to identify uniqe blood draws, samples were grouped based on patient and date of collection. Duplications of tests within days were explored.

#Remove results thatfrom the same collection date/time that are discordant,ones that areconcortant, average together
dblresults<-chop.fc%>%group_by(empi,col,task_name)%>%summarise(n=n(),mean=mean(result),stdev=sd(result),cv=sd(result)/mean(result))%>%filter(n>1)
table(dblresults$task_name)
## 
##   FactorII   FactorIX    FactorV  FactorVII FactorVIII    FactorX 
##          7        145         15         18       1134          7 
##   FactorXI  FactorXII Fibrinogen         PT        PTT 
##          4          3       4951       8252      10934

Results for tests that were drawn on a single day and had CV greater than 20% were removed.

remresults<-dblresults%>%filter(cv>0.2)
chop.fc<-anti_join(chop.fc,remresults,by=c('col' = 'col', 'empi' = 'empi','task_name'='task_name'))
chop.fc<-chop.fc%>%ungroup%>%group_by(empi,col,task_name)%>%summarise(result=mean(result))

Next samples where no screening test or factor test was performed were removed.

#filter to just empi and col days that have a factor. first need day which factor is drawn
chop.fc<-chop.fc%>%mutate(colday=as.Date(substr(col,1,10),format='%Y-%m-%d'))

fac.day<-chop.fc%>%filter(!task_name %in% c('PTT','PT','Fibrinogen'))

chop.fc<-semi_join(chop.fc,fac.day,by=c('colday' = 'colday', 'empi' = 'empi'))
#Now filter to empi and collection days that have a screening test as well.
sc.day<-chop.fc%>%ungroup%>%filter(task_name %in% c('PTT','PT'))
chop.fc<-semi_join(chop.fc,sc.day,by=c('colday' = 'colday', 'empi' = 'empi'))

The final data set was created:

#Now spread for further analysis
table(chop.fc$task_name)
## 
##   FactorII   FactorIX    FactorV  FactorVII FactorVIII    FactorX 
##       1199       1924       1457       2374       8542       1205 
##   FactorXI  FactorXII Fibrinogen         PT        PTT 
##        982        658       2661       9619       9893
chopdf<-chop.fc%>%spread(task_name,result)%>%ungroup


summary(chopdf)
##       empi           col                colday              FactorII     
##  Min.   :    1   Length:10747       Min.   :2001-09-24   Min.   :  1.00  
##  1st Qu.: 1548   Class :character   1st Qu.:2006-01-08   1st Qu.: 59.00  
##  Median : 3776   Mode  :character   Median :2009-08-26   Median : 80.00  
##  Mean   : 4212                      Mean   :2009-07-30   Mean   : 75.54  
##  3rd Qu.: 6632                      3rd Qu.:2013-03-18   3rd Qu.: 95.00  
##  Max.   :10251                      Max.   :2016-11-12   Max.   :169.00  
##                                                          NA's   :9548    
##     FactorIX         FactorV         FactorVII         FactorVIII   
##  Min.   :  0.10   Min.   :  7.00   Min.   :   1.00   Min.   :  1.0  
##  1st Qu.: 31.00   1st Qu.: 60.00   1st Qu.:  38.00   1st Qu.: 78.0  
##  Median : 62.00   Median : 81.00   Median :  57.00   Median :101.0  
##  Mean   : 60.27   Mean   : 82.32   Mean   :  61.65   Mean   :117.4  
##  3rd Qu.: 79.00   3rd Qu.:100.00   3rd Qu.:  75.00   3rd Qu.:139.0  
##  Max.   :556.00   Max.   :316.00   Max.   :1435.00   Max.   :955.0  
##  NA's   :8823     NA's   :9290     NA's   :8373      NA's   :2205   
##     FactorX          FactorXI        FactorXII        Fibrinogen  
##  Min.   :  1.00   Min.   :  1.00   Min.   :  1.00   Min.   :  35  
##  1st Qu.: 59.00   1st Qu.: 56.25   1st Qu.: 38.00   1st Qu.: 228  
##  Median : 78.00   Median : 78.00   Median : 55.00   Median : 275  
##  Mean   : 75.56   Mean   : 76.08   Mean   : 62.86   Mean   : 298  
##  3rd Qu.: 93.00   3rd Qu.: 95.00   3rd Qu.: 84.00   3rd Qu.: 339  
##  Max.   :166.00   Max.   :224.00   Max.   :176.00   Max.   :1507  
##  NA's   :9542     NA's   :9765     NA's   :10089    NA's   :8086  
##        PT               PTT        
##  Min.   :  7.967   Min.   : 17.10  
##  1st Qu.: 12.600   1st Qu.: 28.90  
##  Median : 13.200   Median : 32.00  
##  Mean   : 13.950   Mean   : 35.58  
##  3rd Qu.: 14.100   3rd Qu.: 36.70  
##  Max.   :125.000   Max.   :250.00  
##  NA's   :1128      NA's   :854

Modeling

The following functions were used for the multivariate models. 5-fold cross validation was performed for each model.

library(modelr)
library(purrr)
library(broom)
library(tidyr)
library('randomForest')
library('dplyr')
#df=chopdf.i,k=5,pptt = 'PT'))i=1
modeling<-function(df,k,pptt){

#K-Fold Cross Validation
N = nrow(df)
K = k
ptptt.df<-df
ptptt.df$s = sample(1:K, size=N, replace=T)
pred_outputs.lm <- vector(mode="numeric", length=N)
pred_outputs.rf <- vector(mode="numeric", length=N)
obs_outputs <- vector(mode="numeric", length=N)
identifier<-vector(mode="numeric", length=N)
offset <- 0

for(i in 1:K){
  
  train <- ptptt.df%>%filter(s != i)
  test <-  ptptt.df%>%filter(s == i)

if("accession" %in% colnames(test)){
  identifier[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- test$accession
}else{
  identifier[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- test$empi
}
  
  if(pptt=='PTT'){
    obs_outputs[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- test$PTT
  #LM train/test
  lm <- lm(PTT ~ log(FactorVIII)+log(FactorXI)+log(FactorIX) +log(FactorXII)+log(FactorII)+log(FactorV)+log(FactorX)
           , data=train)
  lm.pred.curr <- predict(lm, test)
  pred_outputs.lm[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- lm.pred.curr

    
  #RF train/test
  rf <- randomForest(PTT ~ (FactorVIII)+(FactorIX)+(FactorXI)
                     +(FactorX)+FactorII+FactorV+FactorXII
                     , ntree=100, data=train , na.action = na.omit)
  rf.pred.curr <- predict(rf, newdata=test) 
  pred_outputs.rf[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- (rf.pred.curr)
  }
  if(pptt=='PT'){
    obs_outputs[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- test$PT
    #LM train/test
    lm <- lm(PT ~ log(FactorVII)+log(FactorX)+log(FactorII)+log(FactorV), data=train)
    lm.pred.curr <- predict(lm, test)
    pred_outputs.lm[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- lm.pred.curr

    #RF train/test
    rf <- randomForest(PT ~(FactorVII)+(FactorX)+(FactorII)+FactorV,na.action = na.omit, data=train, ntree=100)
    rf.pred.curr <- predict(rf, newdata=test) 
    pred_outputs.rf[1:length(ptptt.df$s[ptptt.df$s==i]) + offset] <- (rf.pred.curr)
    
    }
  offset <- offset + length(ptptt.df$s[ptptt.df$s==i])
}
predict<-data.frame(identifier,obs_outputs,pred_outputs.lm,pred_outputs.rf)

predict%>%
    mutate(residual.lm = pred_outputs.lm - obs_outputs,
    residual.rf = pred_outputs.rf - obs_outputs,
    better=ifelse(abs(residual.lm)>abs(residual.rf),'Random Forest','Linear Model'))

}


#The following will be used as a summation of the prediction and for R-squared values
resids<-function(df){  
df %>%na.omit()%>%
    summarise(
      sst = sum((obs_outputs - mean(obs_outputs)) ^ 2),
      sse.lm = sum(residual.lm^2),
      r.squared.lm = 1 -( sse.lm / sst),
      sse.rf = sum(residual.rf^2),
      r.squared.rf = 1 - (sse.rf / sst))
}

Results

The first goal of this project is to validate many of the assumptions and data found within published literature as well as clinical practice in the field of laboratory coagulation testing. These assumptions include (1) the range of results in a normal population, and (2) the dependence of the PT and PTT on the individual factor assays.

Since the PT and PTT tests are used as screening tests, the dataset contains many of these tests and may be used to investigate the validity of the high and low normal range limits used in clinical practice in this lab. In contrast, the individual factor assays are only used when there is a high pre-test probability of an abnormality (generally there is already an abnormal screening test).

tests<-c('PT','PTT')
f1<-list()
for (i in tests){
f1[[i]]<-penn%>%filter(task_name==i)%>%
  ggplot() +
  geom_density(aes(result), fill = 'gray') +
  geom_vline(xintercept = as.numeric(filter(normals, task_name == i)[2])) +
  geom_vline(xintercept = as.numeric(filter(normals, task_name == i)[3])) +
  xlab(paste0('Distribution of values for ',i)) +
  labs(caption = 'Vertical lines represent high and low end of normal range') +
  theme_bw()
}

f1[[2]]

The results indicate that the PTT normal range values match the distribution of results relatively well, however the upper limit of normal may need to be reevaluated.

f1[[1]]

Finally, there seems to be a discrepancy between the normal range for the PT test and the distribution in this data. The results may be significantly biased as the PT is frequently used for monitoring of warfarin therapy. Future work will need to investigate these trends subsetted on individuals not on biasing medications.

It is helpful when performing regression analysis to have a distribution of results. The Distributions of results were explored with histograms of screening lab results and specific coagulation factors (Figure 1).

library(gridExtra)
myhists <- list()  # new empty list
for (i in 1:length(colnames(penndf))) {
    p1 <- eval(substitute(
        ggplot(data=data.frame(penndf),aes(x=penndf[,i]))+ 
          geom_histogram(fill="grey") +
          geom_vline(xintercept = as.numeric(filter(normals,task_name==names(penndf)[i])[2]))+
          geom_vline(xintercept = as.numeric(filter(normals,task_name==names(penndf)[i])[3]))+
          xlab(paste0(names(penndf)[i]))+
          theme_bw()
    ,list(i = i)))
    #print(i)
    #print(p1)
    myhists[[i]] <- p1  # add each plot into plot list
    names(myhists)[[i]]<-names(penndf)[i]
}


grid.arrange(myhists[[3]], 
             myhists[[4]],
             myhists[[5]],
             myhists[[6]],
             myhists[[7]],
             myhists[[8]],
             myhists[[9]],
             myhists[[10]],
             myhists[[11]],
             myhists[[12]],
             myhists[[13]],
             myhists[[14]],
             nrow=3, ncol=4,top="Distribution of lab results in data set")

The data demonstrates that there is a relatively evenly distributed set of results for predictors lab values. For many of the factor assays there seems to be a bias towards the low end of the measurement range, likely indicating the heavily biased tested population, with a high pre-test probability of disease.

To validate that the relationship between factors and their screening tests were present in the data, boxplots were used to depict the relationship between normal and low factor levels (<25%) and the screening test.

library(gtools)

makebx.pt<-function(x,y){
    df<-na.omit(penndf[,c(x,y)])
    z<-as.matrix(df[,x])
    df[,x]<-ifelse(z>25,'Normal',"Low")

    ggplot(df,aes_string(x=x ,y=y))+
          geom_boxplot() +
          xlab(names(df)[1])
}
tests<-names(penndf)
bxpt<-lapply(tests,y='PT',makebx.pt)
names(bxpt)<-tests

grid.arrange(bxpt$FactorII, 
             bxpt$FactorV,
             bxpt$FactorX,
             bxpt$FactorVII,
             bxpt$FactorVIII,
             bxpt$FactorIX,
             bxpt$FactorXI,
             bxpt$FactorXII,
             nrow=3, ncol=4,top="Box plots depicting relationship of coagulation factors and Prothrombin Time")

These plots validate our biological understanding that the PT is associated with levels of factors 2,5,10 and factor 7, but no other factors in the intrinsic pathway.

Similar plots were created for factors against PTT.

tests<-names(penndf)
bxptt<-lapply(tests,y='PTT',makebx.pt)
names(bxptt)<-tests


grid.arrange(bxptt$FactorII, 
             bxptt$FactorV,
             bxptt$FactorX,
             bxptt$FactorVII,
             bxptt$FactorVIII,
             bxptt$FactorIX,
             bxptt$FactorXI,
             bxptt$FactorXII,
             nrow=3, ncol=4,top="Box plots depicting relationship of coagulation factor quantiles and partial thromboplastin time")

The results were somewhat surprising in that there appeared to be a relationship in these univariate plots between PTT and factor 7, and a lack of relationship with factor 9.

To further explore the relationships amongst these measures, factor levels were plotted against screening tests with dot plots.

makedot.pt<-function(x,y){
    df<-na.omit(penndf[,c(x,y)])
    df<-as.data.frame(df)
    ggplot(df,aes_string(x=x ,y=y))+
          geom_point(alpha=0.5,size=0.5) +
          geom_smooth(color='gray',method = 'lm')+
          theme_bw()+
          ylab('log(PT)')+
          xlab(paste0('log(',names(df)[1],')'))+
          scale_y_log10()+
          scale_x_log10()
      
    #print(p1)
    #bxpt[[x]] <- p1# add each plot into plot list
    #x<-x+1
    #gg
}
tests<-names(penndf)
dotpt<-lapply(tests,y='PT',makedot.pt)
names(dotpt)<-tests


grid.arrange(dotpt$FactorII, 
             dotpt$FactorV,
             dotpt$FactorX,
             dotpt$FactorVII,
             dotpt$FactorVIII,
             dotpt$FactorIX,
             dotpt$FactorXI,
             dotpt$FactorXII,
             nrow=3, ncol=3,top="Box plots depicting relationship of coagulation factor quantiles and partial thromboplastin time")#In the future suggest adding normal range vertical bars

As was indicated in the box plots a relationship exists in the data between PT and factor 7 and the common factors, but not with factors of the intrinsic pathway.

dotptt<-lapply(tests,y='PTT',makedot.pt)
names(dotptt)<-tests


grid.arrange(dotptt$FactorII, 
             dotptt$FactorV,
             dotptt$FactorX,
             dotptt$FactorVII,
             dotptt$FactorVIII,
             dotptt$FactorIX,
             dotptt$FactorXI,
             dotptt$FactorXII,
             nrow=3, ncol=3, top="Scatterplots of independent variables vs PTT")

#In the future suggest adding normal range vertical bars

The resulting plots against the PTT indicate relationshipss with PTT, the intrinsic and common factors, though quantitatively there appeared to be less of a relationship with factor 9.

#Considered plotting all the smooths together on one plot per screening test, not sure it makes sense.
# library(RColorBrewer)
# bl<-brewer.pal(7,'Blues')[4:7]
# gr<-brewer.pal(7,'Greens')[4:7]
# re<-brewer.pal(1,'Reds')
# cols<-c(bl,'red',gr)
# cols[7]<-'red'
# cols[5]<-"#41AB5D"
# 
# ggplot(penndf)+
#   geom_smooth(aes(x=FactorVII,y=PT,color='FactorVII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=Fibrinogen,y=PT,color='Fibrinogen'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorV,y=PT,color='FactorV'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorII,y=PT,color='FactorII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorX,y=PT,color='FactorX'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorVIII,y=PT,color='Factor VIII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorIX,y=PT,color='Factor IX'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorXI,y=PT,color='Factor XI'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorXII,y=PT,color='Factor XII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   scale_colour_manual(name="legend", values=cols)+
#   geom_hline(yintercept = as.numeric(normals$normal_low[10])) +
#   geom_hline(yintercept = as.numeric(normals$normal_high[10]))+
#   theme_bw()+
#   xlim(0,150)
# 
# 
# ggplot(penndf)+
#   geom_smooth(aes(x=FactorVII,y=PTT,color='FactorVII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=Fibrinogen,y=PTT,color='Fibrinogen'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorV,y=PTT,color='FactorV'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorII,y=PTT,color='FactorII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorX,y=PTT,color='FactorX'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorVIII,y=PTT,color='Factor VIII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorIX,y=PTT,color='Factor IX'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorXI,y=PTT,color='Factor XI'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   geom_smooth(aes(x=FactorXII,y=PTT,color='Factor XII'),method = "lm", formula = y ~ log(x),show.legend = TRUE,se=FALSE)+
#   scale_colour_manual(name="legend", values=cols)+                     
#   geom_hline(yintercept = as.numeric(normals$normal_low[10])) +
#   geom_hline(yintercept = as.numeric(normals$normal_high[10]))+
#   theme_bw()+
#   xlim(0,150)
# 
#   scale_colour_manual(name="legend",
#                       breaks=c('Fibrinogen','FactorII','FactorV','FactorX','FactorVIII','FactorIX','FactorXI','FactorXII','FactorVII'),
#                       labels=c('Fibrinogen','FactorII','FactorV','FactorX','FactorVIII','Factor IX','Factor XI','Factor XII','Factor VII')values=c("red",'#00FF33','#33CC33','#00CC00','#006600','#0000FF','#0000CC','#3399CC','#3366CC'))+
# ```

To quantitatively assess these relationships linear models were created with PT as the dependent variable.

attach(penndf)

#perofrm lm and store as list
pt.lm<-lapply( penndf[,-1], function(x) summary(lm(penndf$PT ~ x)) )
#Extract relavent parameters and store as dataframe lmresults.pt
variable<-names(pt.lm)
cols<-as.list(c(1:14))
beta.f<-function(x){
  pt.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  pt.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  pt.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmresults.pt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='linear'),stringsAsFactors = FALSE)
lmresults.pt[2:4]<-sapply(lmresults.pt[2:4],as.numeric)

penndf[penndf==0]<-1

#perofrm lm for pt on log of dependent variable and store as list
ptlg.lm<-lapply( penndf[,-1], function(x) summary(lm(penndf$PT ~ log(x))))
#Extract relavent parameters and store as dataframe lmresults.pt
variable<-names(ptlg.lm)
cols<-as.list(c(1:14))
beta.f<-function(x){
  ptlg.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  ptlg.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  ptlg.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmresultslg.pt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='log'),stringsAsFactors = FALSE)
lmresultslg.pt[2:4]<-sapply(lmresultslg.pt[2:4],as.numeric)
lmresults.pt<-bind_rows(lmresults.pt,lmresultslg.pt)%>%filter(!variable %in% c('accession','allna','PT','PTT','ReptilaseTest'))

knitrkable<-function(df){
df = data.frame(sapply(df, format, digits=3))
knitr::kable(arrange(df,variable))
}

knitrkable(lmresults.pt)
variable beta pvalue arsquare log
FactorII -0.11194 1.86e-10 0.198013 linear
FactorII -7.12183 4.09e-18 0.338863 log
FactorIX -0.02638 1.13e-02 0.019545 linear
FactorIX -0.32677 3.43e-01 -0.000361 log
FactorV -0.02991 5.38e-02 0.013771 linear
FactorV -4.41260 2.68e-07 0.121580 log
FactorVII -0.08596 5.29e-09 0.148039 linear
FactorVII -6.39703 2.59e-22 0.363801 log
FactorVIII 0.01259 3.44e-42 0.103968 linear
FactorVIII 0.47046 2.23e-11 0.025733 log
FactorX -0.15990 1.59e-13 0.243551 linear
FactorX -9.04819 1.49e-34 0.541386 log
FactorXI 0.01472 2.80e-01 0.000795 linear
FactorXI 0.65486 1.43e-01 0.005291 log
FactorXII -0.03170 1.56e-02 0.135421 linear
FactorXII -2.66268 2.45e-04 0.310799 log
Fibrinogen -0.00426 9.96e-168 0.023077 linear
Fibrinogen -2.51164 0.00e+00 0.063838 log

These were plotted for ease of analysis

library(ggrepel)
gglm<-function(df){
ggplot(filter(df,as.numeric(as.character(pvalue))<=0.5), aes(x=as.numeric(as.character(beta)),y=as.numeric(as.character(arsquare)),color=log,label=variable))+
  geom_point()+
  geom_text_repel(hjust = 0, nudge_x = 0.06,show.legend = FALSE)+
  theme_bw()+
    xlab('Beta')+
    ylab('R-Sqaured')
}
gglm(lmresults.pt)+ggtitle('Results of LM for PT results')

ggplot(lmresults.pt, aes(x=beta,y=arsquare,color=log,label=variable))+
  geom_point()+
  geom_text_repel(hjust = 0, nudge_x = 0.06,show.legend = FALSE)+
  theme_bw()

The results of the univariate regression support much of the information and conventional wisdom in the literature. The most significant predictor of the prothrombin time are common pathway and extrinsic pathway coagulation factors (Factors X,II,VII). Interestingly, the logarithmic model produces the most robust predictors. Unexpectedly, two common pathway factors had little impact on the Prothrombin time, Factor V and Fibrinogen. The literature suggests that biologically, very little Factor V activity is required for coagulation, which supports the results of this analysis.

attach(penndf)

#perofrm lm and store as list
ptt.lm<-lapply( penndf[,-1], function(x) summary(lm(penndf$PTT ~ x)) )
#Extract relavent parameters and store as dataframe lmresults.ptt
variable<-names(ptt.lm)
cols<-as.list(c(1:14))
beta.f<-function(x){
  ptt.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  ptt.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  ptt.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmresults.ptt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='linear'),stringsAsFactors = FALSE)
lmresults.ptt[2:4]<-sapply(lmresults.ptt[2:4],as.numeric)


#perofrm lm for ptt on log of dependent variable and store as list
pttlg.lm<-lapply( penndf[,-1], function(x) summary(lm(penndf$PTT ~ log(x))))
#Extract relavent parameters and store as dataframe lmresults.ptt
variable<-names(pttlg.lm)
cols<-as.list(c(1:14))
beta.f<-function(x){
  pttlg.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  pttlg.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  pttlg.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmresultslg.ptt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='log'),stringsAsFactors = FALSE)
lmresultslg.ptt[2:4]<-sapply(lmresultslg.ptt[2:4],as.numeric)
lmresults.ptt<-bind_rows(lmresults.ptt,lmresultslg.ptt)%>%filter(!variable %in% c('accession','allna','PT','PTT','ReptilaseTest'))
knitr::kable(arrange(lmresults.ptt,variable))
variable beta pvalue arsquare log
FactorII -0.1942124 0.0001603 0.0711688 linear
FactorII -10.8612340 0.0000126 0.0957737 log
FactorIX -0.1049640 0.0043785 0.0256164 linear
FactorIX -8.4691400 0.0000000 0.1713060 log
FactorV -0.1538656 0.0001721 0.0642651 linear
FactorV -11.0804822 0.0000008 0.1120646 log
FactorVII -0.0170598 0.7034895 -0.0041262 linear
FactorVII -3.9658146 0.0631169 0.0118341 log
FactorVIII -0.0362721 0.0000000 0.0547291 linear
FactorVIII -7.4088227 0.0000000 0.4138068 log
FactorX -0.1393857 0.0033211 0.0390460 linear
FactorX -7.5630835 0.0000248 0.0839177 log
FactorXI -0.1384195 0.0000361 0.0715314 linear
FactorXI -4.7460341 0.0000161 0.0781015 log
FactorXII -0.2620072 0.0300836 0.1054240 linear
FactorXII -18.4689706 0.0078019 0.1665895 log
Fibrinogen -0.0133376 0.0000000 0.0244148 linear
Fibrinogen -6.9000287 0.0000000 0.0520597 log
gglm(lmresults.ptt)+ggtitle('Results of LM for PTT results')

ggplot(lmresults.ptt, aes(x=beta,y=arsquare,color=log,label=variable))+
  geom_point()+
  geom_text_repel(hjust = 0, nudge_x = 0.06,show.legend = FALSE)+
  theme_bw()

Like the results of the linear regression models for the prothrombin time, the results for the partial thromboplastin time confirm many common assumptions about the relationships between these factors and provide some surprising results. By far the best predictors are factors VIII and XI, particularly the log transformed values. These are both coagulation factors in the intrinsic pathway that are known to have a strong impact on partial thromboplastin times. Amongst the factors that were found to be less associated with PTT were the common pathway factors, despite the fact that the common pathway is required for coagulation testing using the PTT. This result is most likely due to a combination of a relatively weak impact of these factors on PTT clotting times as well as the limitations in the number of datapoints in the dataset.

Imputation of Penn data

To perform multivariate analyses the data set was imputed with data the simulates results in the normal range for each missing factor assay. Imputations were performed separately for PT and PTT based models in order to account for different factor requirements in each. PTT based models included data points with at least 1/3 intrinsic factors. PT based models included at least 1/4 extrinsic or common pathway factors.

#Impute for PTT analysis
penndf$facna<-rowSums(apply(is.na(penndf[,c(4,7,9)]), 2, as.numeric))
penndf.ii<-penndf%>%filter(facna<3)%>%filter(!is.na(PTT))%>%data.frame
penndf.ii[is.na(penndf.ii)]<-rnorm(mean = 90, sd=10, n=length(penndf.ii[is.na(penndf.ii)]))

Models for Penn PTT

Results for PTT based linear models and random forest are depicted below

predictions.p.ptt<-modeling(df=penndf.ii,k=5,pptt='PTT')

rs.p.ptt<-resids(predictions.p.ptt)

knitr::kable(rs.p.ptt,col.names = c('SST','SSE for LM', 'r^2 for LM','SSE for RF','r^2 for RF'))
SST SSE for LM r^2 for LM SSE for RF r^2 for RF
555400.9 262992.5 0.5264818 185163.7 0.6666126

A plot of residuals was created as well.

library('ggExtra')
ggplot(data=predictions.p.ptt,aes(obs_outputs,residual.lm)) +
  geom_hline(yintercept = 0) +
  geom_point(aes(y=residual.lm),color='blue',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.lm),color='blue',fill='blue') +
  geom_point(aes(y=residual.rf),color='red',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.rf),color='red',fill='red') +
  #geom_segment(aes(x = obs_outputs, y = residual.lm, xend = obs_outputs, yend = residual.rf, colour = better),arrow=arrow(length = unit(0.01, "npc")),alpha=0.3)+ 
  scale_colour_manual(values = c("blue","red"),
                      guide = guide_legend(title = "Regression Models",
                                           keywidth = 3, 
                                          keyheight = 1,
                                          label.position = "left"))+
  geom_vline(xintercept = as.numeric(filter(normals, task_name == 'PTT')[2])) +
  geom_vline(xintercept = as.numeric(filter(normals, task_name == 'PTT')[3])) +
  theme_minimal(base_size = 20)+
  #theme(legend.position = c(0.75, .85))+
  labs(title = "Regression Models for PTT",
       subtitle= paste("The R-squared for the LM is",round(rs.p.ptt$r.squared.lm,2),"and the R-squared for the Randomforest is",round(rs.p.ptt$r.squared.rf,2)),
       x="Observed PTT (s)",
       y="Residuals")

The R squared for the linear model and random forest model were consistently 0.5 and close to 0.7 after imputation. The plot of residuals indicates that prediction was superior at lower PTT values, an important distinction to make as this is often portion of the measurement range where questions regarding the need for additional testing arises. Limited exploration of residual outliers was performed on five events in which the random forest model overestimated the PTT by ~50 s. This was due to a patient that clearly had outlier data values, in the form of abnormally high PTT values given his Factor IX value.

Similar models were created for the prediction of PT values after imputation.

penndf$facna<-rowSums(apply(is.na(penndf[,c(3,5,6,8)]), 2, as.numeric))
penndf.i<-penndf%>%filter(facna<4)%>%filter(!is.na(PT))%>%data.frame
penndf.i[is.na(penndf.i)]<-rnorm(mean = 90, sd=10, n=length(penndf.i[is.na(penndf.i)]))
pt.df<-penndf.i           

predictions.p.pt<- modeling(df=penndf.i,k=5,pptt = 'PT')
rs.p.pt<-resids(predictions.p.pt)
knitr::kable(rs.p.pt,col.names = c('SST','SSE for LM', 'r^2 for LM','SSE for RF','r^2 for RF'))
SST SSE for LM r^2 for LM SSE for RF r^2 for RF
24781.66 12247.06 0.5058017 8159.06 0.6707622

A plot of residuals was created as well.

ggplot(data=predictions.p.pt,aes(obs_outputs)) +
  geom_hline(yintercept = 0) +
  geom_point(aes(y=residual.lm),color='blue',alpha=0.3,size=0.5) +
  stat_smooth(method = "loess",aes(y=residual.lm),color='blue',fill='blue') +
  geom_point(aes(y=residual.rf),color='red',alpha=0.3,size=0.5) +
  stat_smooth(method = "loess",aes(y=residual.rf),color='red',fill='red') +
  #geom_segment(aes(x = obs_outputs, y = residual.lm, xend = obs_outputs, yend = residual.rf, colour = better),arrow=arrow(length = unit(0.01, "npc")),alpha=0.3)+ 
  scale_colour_manual(values = c("blue","red"),
                      guide = guide_legend(title = "Regression Models",
                                           keywidth = 3, 
                                          keyheight = 1,
                                          label.position = "left"))+
  geom_vline(xintercept = as.numeric(filter(normals, task_name == 'PTT')[2])) +
  geom_vline(xintercept = as.numeric(filter(normals, task_name == 'PTT')[3])) +
  theme_minimal(base_size = 20)+
  #theme(legend.position = c(0.75, .85))+
  labs(title = "Regression Models for PT",
       subtitle= paste("The R-squared for the LM is",round(rs.p.pt$r.squared.lm,2),"and the R-squared for the Randomforest is",round(rs.p.pt$r.squared.rf,2)),
       x="Observed PT (s)",
       y="Residuals")

The models performed adequately providing R-squared in the range of 0.48 for the linear model and 0.54 for the random forest.

Given the success of modeling PT and PTT values with Penn data, I was curious to see if similar results could be obtained with CHOP coagulation lab data. This data is somewhat different form the penn data in regards to the far more heterogenous patient population, differences in ordering patters, as well as analytical differences.

Histograms of results distributions were explored for the CHOP data.

library(gridExtra)
myhists <- list()  # new empty list
for (i in 4:length(colnames(chopdf))) {
    p1 <- eval(substitute(
        ggplot(data=data.frame(chopdf),aes(x=chopdf[,i]))+ 
          geom_histogram(fill="grey") +
          #geom_vline(xintercept = as.numeric(filter(normals,task_name==names(penndf)[i])[2]))+
          #geom_vline(xintercept = as.numeric(filter(normals,task_name==names(penndf)[i])[3]))+
          xlab(paste0(names(chopdf)[i]))+
          theme_bw()
    ,list(i = i)))
    #print(i)
    #print(p1)
    myhists[[i]] <- p1  # add each plot into plot list
    names(myhists)[[i]]<-names(chopdf)[i]
}


grid.arrange(myhists[[4]],
             myhists[[5]],
             myhists[[6]],
             myhists[[7]],
             myhists[[8]],
             myhists[[9]],
             myhists[[10]],
             myhists[[11]],
             myhists[[12]],
             myhists[[13]],
             nrow=3, ncol=4,top="Distribution of lab results in data set")

The data demonstrates that there is a relatively evenly distributed set of results for predictors lab values.

library(gtools)

makebx.pt<-function(x,y){
    df<-na.omit(chopdf[,c(x,y)])
    z<-as.matrix(df[,x])
    df[,x]<-ifelse(z>25,'Normal',"Low")
    df<-as.data.frame(df)
    ggplot(df,aes_string(x=x ,y=y))+
          geom_boxplot() +
          xlab(names(df)[1])+
          scale_y_log10()
    #print(p1)
    #bxpt[[x]] <- p1# add each plot into plot list
    #x<-x+1
    #gg
}
tests<-as.list(names(chopdf)[c(3:11)])
bxpt<-lapply(tests,y='PT',makebx.pt)
names(bxpt)<-tests


grid.arrange(bxpt$FactorII, 
             bxpt$FactorV,
             bxpt$FactorX,
             bxpt$FactorVII,
             bxpt$FactorVIII,
             bxpt$FactorIX,
             bxpt$FactorXI,
             bxpt$FactorXII,
              nrow=3, ncol=3,top="Box plots depicting relationship of coagulation factors and Prothrombin Time")

As with the Penn data, the results confirm that the known biological relationships are present within this clinical data as well.

bxptt<-lapply(tests,y='PTT',makebx.pt)
names(bxptt)<-tests


grid.arrange(bxptt$FactorII, 
             bxptt$FactorV,
             bxptt$FactorX,
             bxptt$FactorVII,
             bxptt$FactorVIII,
             bxptt$FactorIX,
             bxptt$FactorXI,
             bxptt$FactorXII,
             nrow=3, ncol=3,top="Box plots depicting relationship of coagulation factor quantiles and partial thromboplastin time")

As with the PT data, biological and clinical assocations were confirmed with extrinsic and common pathway factors appearing to be related to PTT. Again, a relationship appears to exist between PTT and factor 7 for unclear reasons.

To further explore the relationships amongst these measures, bivariate dot plots were created.

makedot.pt<-function(x,y){
    df<-na.omit(chopdf[,c(x,y)])
    df<-as.data.frame(df)
    ggplot(df,aes_string(x=x ,y=y))+
          geom_point(alpha=0.5,size=0.5) +
          geom_smooth(color='gray',method = 'lm')+
          theme_bw()+
          xlab(names(df)[1])+
          scale_y_log10()+
          scale_x_log10()
    #print(p1)
    #bxpt[[x]] <- p1# add each plot into plot list
    #x<-x+1
    #gg
    }
dotpt<-lapply(tests,y='PT',makedot.pt)
names(dotpt)<-tests


grid.arrange(dotpt$FactorII, 
             dotpt$FactorV,
             dotpt$FactorX,
             dotpt$FactorVII,
             dotpt$FactorVIII,
             dotpt$FactorIX,
             dotpt$FactorXI,
             dotpt$FactorXII,
             nrow=3, ncol=3,top="Scatter plots depicting relationship of coagulation factors and PT")#In the future suggest adding normal range vertical bars

dotptt<-lapply(tests,y='PTT',makedot.pt)
names(dotptt)<-tests


grid.arrange(dotptt$FactorII, 
             dotptt$FactorV,
             dotptt$FactorX,
             dotptt$FactorVII,
             dotptt$FactorVIII,
             dotptt$FactorIX,
             dotptt$FactorXI,
             dotptt$FactorXII,
             nrow=3, ncol=3,top="Scatterplots of independent variables vs PTT")

These results confirm the impression created by the box plots that the data contains the relationship between screening assays and their dependent factors.

Univariate correlations for CHOP data and Screening tests

These were followed up with linear models to quantify these relationships.

attach(chopdf)
chopdf.lm<-chopdf%>%ungroup%>%select(4:14)
#perofrm lm and store as list
ptt.chop.lm<-lapply( chopdf.lm[,-2], function(x) summary(lm(chopdf.lm$PTT ~ x)) )
#Extract relavent parameters and store as dataframe lmresults.ptt
variable<-names(ptt.chop.lm)
cols<-as.list(c(1:11))
beta.f<-function(x){
  ptt.chop.lm[x][[1]][[4]][2,1]
}
pvalue.f<-function(x){
  ptt.chop.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  ptt.chop.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmres.chop.ptt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='linear'),stringsAsFactors = FALSE)
lmres.chop.ptt[2:4]<-sapply(lmres.chop.ptt[2:4],as.numeric)

columns<-names(chopdf.lm)[1:11]
#perofrm lm for ptt on log of dependent variable and store as list
pttlg.chop.lm<-lapply( chopdf.lm[,-2], function(x) summary(lm(chopdf.lm$PTT ~ log(x))))
#Extract relavent parameters and store as dataframe lmres.chop.ptt
variable<-names(pttlg.chop.lm)
cols<-as.list(c(1:10))
beta.f<-function(x){
  pttlg.chop.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  pttlg.chop.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  pttlg.chop.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmres.chop.lg.ptt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='log'),stringsAsFactors = FALSE)
lmres.chop.lg.ptt[2:4]<-sapply(lmres.chop.lg.ptt[2:4],as.numeric)
lmres.chop.ptt<-bind_rows(lmres.chop.ptt,lmres.chop.lg.ptt)%>%filter(!variable %in% c('accession','allna','PT','PTT'))

knitrkable(lmres.chop.ptt)
variable beta pvalue arsquare log
FactorII -0.2826 3.13e-35 0.13409 linear
FactorII -15.3512 1.27e-45 0.17217 log
FactorV -0.1840 5.45e-20 0.06239 linear
FactorV -17.8979 1.91e-35 0.11248 log
FactorVII -0.0185 1.78e-03 0.00438 linear
FactorVII -9.1679 8.79e-66 0.13626 log
FactorVIII -0.0264 8.62e-25 0.01304 linear
FactorVIII -7.2305 2.37e-211 0.11359 log
FactorX -0.1571 1.87e-13 0.04905 linear
FactorX -8.5570 2.34e-17 0.06485 log
FactorXI -0.3168 1.07e-39 0.17070 linear
FactorXI -18.5503 2.59e-138 0.49213 log
FactorXII -0.1058 2.22e-05 0.02602 linear
FactorXII -12.4389 5.19e-23 0.13916 log
Fibrinogen -0.0156 9.70e-07 0.00931 linear
Fibrinogen -10.6318 1.19e-27 0.04678 log
gglm(lmres.chop.ptt)+ggtitle('Results of Linear Models for PTT results')

These results confirm the association of intrinsic and common factors with the PTT, particularly when log transformed. Oddly, FactorVII appeared to be associated as well, which would not be biologically predicted. This may be due to covariates not accounted for in these univariate models.

Calculate Univariate correlations for CHOP data PT

attach(chopdf.lm)
#perofrm lm and store as list
pt.chop.lm<-lapply( chopdf.lm[,-1], function(x) summary(lm(chopdf.lm$PT ~ x)) )
#Extract relavent parameters and store as dataframe lmresults.pt
variable<-names(pt.chop.lm)
cols<-as.list(c(1:10))
beta.f<-function(x){
  pt.chop.lm[x][[1]][[4]][2,1]
}
pvalue.f<-function(x){
  pt.chop.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  pt.chop.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmres.chop.pt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='linear'),stringsAsFactors = FALSE)
lmres.chop.pt[2:4]<-sapply(lmres.chop.pt[2:4],as.numeric)

columns<-names(chopdf.lm)[1:11]
#perofrm lm for pt on log of dependent variable and store as list
ptlg.chop.lm<-lapply( chopdf.lm[,-1], function(x) summary(lm(chopdf.lm$PT ~ log(x))))
#Extract relavent parameters and store as dataframe lmres.chop.pt
variable<-names(ptlg.chop.lm)
cols<-as.list(c(1:10))
beta.f<-function(x){
  ptlg.chop.lm[x][[1]][[4]][2,1]
}

pvalue.f<-function(x){
  ptlg.chop.lm[x][[1]][[4]][2,4]
}

arsquared.f<-function(x){
  ptlg.chop.lm[x][[1]][[9]]
}

beta<-lapply(cols, beta.f)
pvalue<-lapply(cols, pvalue.f)
arsquare<-lapply(cols, arsquared.f)

lmres.chop.lg.pt<-as.data.frame(cbind(variable=unlist(variable),beta=unlist(beta),pvalue=unlist(pvalue),arsquare=unlist(arsquare),log='log'),stringsAsFactors = FALSE)
lmres.chop.lg.pt[2:4]<-sapply(lmres.chop.lg.pt[2:4],as.numeric)
lmres.chop.pt<-bind_rows(lmres.chop.pt,lmres.chop.lg.pt)%>%filter(!variable %in% c('accession','allna','PT','PTT'))

knitrkable(lmres.chop.pt)
variable beta pvalue arsquare log
FactorIX -0.020518 9.85e-09 0.02103 linear
FactorIX -0.907313 3.00e-12 0.03131 log
FactorV -0.033603 1.15e-09 0.02532 linear
FactorV -4.654673 4.63e-32 0.09346 log
FactorVII -0.020058 2.78e-24 0.04307 linear
FactorVII -5.523149 3.93e-288 0.43188 log
FactorVIII 0.011595 4.11e-80 0.04486 linear
FactorVIII 0.896238 2.55e-45 0.02512 log
FactorX -0.101022 3.06e-43 0.14963 linear
FactorX -7.549011 6.01e-121 0.37372 log
FactorXI 0.000684 8.23e-01 -0.00112 linear
FactorXI -0.168798 1.10e-01 0.00183 log
FactorXII -0.000738 8.03e-01 -0.00170 linear
FactorXII -0.023048 8.79e-01 -0.00178 log
Fibrinogen -0.002804 1.60e-03 0.00344 linear
Fibrinogen -2.474258 1.08e-19 0.03082 log
#Funciton for plotting the univariate lm results

gglm(lmres.chop.pt)+ggtitle('Results of Linear Models for PT results')

The results are somewhat surprising in that there appears in the data to be a correlation of almost factors with PTT, after log transformation, including factor VII. On the other hand far fewer factors are associated with PT values, though the ones that are correlated include those that would be assumed based on biology.

Creating multivariate models for CHOP

Before constructing multivariate models for the CHOP data imputation of normal values was performed on a dataset limited to include at least one intrinsic or one common/extrinsic factors for the PTT and PT models respectively.

#Imputation for PT models
chopdf$facna<-rowSums(apply(is.na(chopdf[,c(4,6,7,9)]), 2, as.numeric))
chopdf.i<-chopdf%>%filter(facna<4)%>%filter(!is.na(PT))%>%data.frame
chopdf.i[is.na(chopdf.i)]<-rnorm(mean = 90, sd=10, n=length(chopdf.i[is.na(chopdf.i)]))

Train and cross validate multivariate models for PT CHOP

predictions.c.pt<-modeling(chopdf.i,k=5,pptt='PT')
rs.c.pt<-resids(predictions.c.pt)
knitr::kable(rs.c.pt,col.names = c('SST','SSE for LM', 'r^2 for LM','SSE for RF','r^2 for RF'))
SST SSE for LM r^2 for LM SSE for RF r^2 for RF
104481.8 57553.23 0.4491557 40655.79 0.6108817

A plot of residuals was created as well.

ggplot(data=predictions.c.pt,aes(obs_outputs)) +
  geom_hline(yintercept = 0) +
  geom_point(aes(y=residual.lm),color='blue',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.lm),color='blue',fill='blue') +
  geom_point(aes(y=residual.rf),color='red',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.rf),color='red',fill='red') +
  #geom_segment(aes(x = obs_outputs, y = residual.lm, xend = obs_outputs, yend = residual.rf, colour = better),arrow=arrow(length = unit(0.01, "npc")),alpha=0.3)+ 
  scale_colour_manual(values = c("blue","red"),
                      guide = guide_legend(title = "Regression Models",
                                           keywidth = 3, 
                                          keyheight = 1,
                                          label.position = "left"))+
  theme_minimal(base_size = 20)+
  #theme(legend.position = c(0.75, .85))+
  labs(title = "Regression Models for PT",
       subtitle= paste("The R-squared for the LM is",round(rs.c.pt$r.squared.lm,2),"and the R-squared for the Randomforest is",round(rs.c.pt$r.squared.rf,2)),
       x="Observed PT (s)",
       y="Residuals")

The models performed well providing R-squared in the range of 0.47 for the linear model and 0.65 for the random forest.

Impute Chop data for PTT

chopdf$facna<-rowSums(apply(is.na(chopdf[,c(5,8,10)]), 2, as.numeric))
chopdf.ii<-chopdf%>%filter(facna<2)%>%filter(!is.na(PT))%>%data.frame
chopdf.ii[is.na(chopdf.ii)]<-rnorm(mean = 90, sd=10, n=length(chopdf.ii[is.na(chopdf.ii)]))

Build PTT Models CHOP

predictions.c.ptt<-modeling(chopdf.ii,k=5,pptt='PTT')
rs.c.ptt<-resids(predictions.c.ptt)
knitr::kable(rs.c.ptt,col.names = c('SST','SSE for LM', 'r^2 for LM','SSE for RF','r^2 for RF'))
SST SSE for LM r^2 for LM SSE for RF r^2 for RF
927287.1 600622.5 0.3522798 530980.4 0.4273829

A plot of residuals was created for the CHOP PTT models,as well.

ggplot(data=predictions.c.ptt,aes(obs_outputs)) +
  geom_hline(yintercept = 0) +
  geom_point(aes(y=residual.lm),color='blue',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.lm),color='blue',fill='blue') +
  geom_point(aes(y=residual.rf),color='red',alpha=0.3,size=0.4) +
  stat_smooth(method = "loess",aes(y=residual.rf),color='red',fill='red') +
  #geom_segment(aes(x = obs_outputs, y = residual.lm, xend = obs_outputs, yend = residual.rf, colour = better),arrow=arrow(length = unit(0.01, "npc")),alpha=0.3)+ 
  scale_colour_manual(values = c("blue","red"),
                      guide = guide_legend(title = "Regression Models",
                                           keywidth = 3, 
                                          keyheight = 1,
                                          label.position = "left"))+
  theme_minimal(base_size = 20)+
  #theme(legend.position = c(0.75, .85))+
  labs(title = "Regression Models for PTT",
       subtitle= paste("The R-squared for the LM is",round(rs.c.ptt$r.squared.lm,2),"and the R-squared for the Randomforest is",round(rs.c.ptt$r.squared.rf,2)),
       x="Observed PTT (s)",
       y="Residuals")

The data demonstrate fair performance when modeling PTT values with a LM model R^2 of 0.38 and RF model R^2 of 0.47. This is in contrast to the models of PTT results created with Penn data which had superior performance, up to an R-squared of 0.69.

Performance of trained models against test data from opposing hospital

Finally the performance of models created with data from each institution was tested against the data from the opposing institution. This was performed to investigate how flexible these models were to vastly different clinical arenas, to include the significant differences between pediatric and adult medicine.

#Build 8 complete models for testing against oppositin institution (2 screening tests X 2 models X 2 Hospitals)
      lm.p.ptt <- lm(PTT ~ log(FactorVIII)+log(FactorXI)+log(FactorIX) +log(FactorXII)+log(FactorII)+log(FactorV)+log(FactorX)
               , data=penndf.ii)
      rf.p.ptt <- randomForest(PTT ~ (FactorVIII)+(FactorIX)+(FactorXI)
                         +(FactorX)+FactorII+FactorV+FactorXII
                         , ntree=100, data=penndf.ii , na.action = na.omit)

      lm.p.pt <- lm(PT ~ log(FactorVII)+log(FactorX)+log(FactorII)+log(FactorV), data=penndf.i)
      rf.p.pt <- randomForest(PT ~(FactorVII)+(FactorX)+(FactorII)+FactorV,na.action = na.omit, data=penndf.i, ntree=100)

      lm.c.ptt <- lm(PTT ~ log(FactorVIII)+log(FactorXI)+log(FactorIX) +log(FactorXII)+log(FactorII)+log(FactorV)+log(FactorX)
               , data=chopdf.ii)
      rf.c.ptt <- randomForest(PTT ~ (FactorVIII)+(FactorIX)+(FactorXI)
                         +(FactorX)+FactorII+FactorV+FactorXII
                         , ntree=100, data=chopdf.ii , na.action = na.omit)

      lm.c.pt <- lm(PT ~ log(FactorVII)+log(FactorX)+log(FactorII)+log(FactorV), data=chopdf.i)
      rf.c.pt <- randomForest(PT ~(FactorVII)+(FactorX)+(FactorII)+FactorV,na.action = na.omit, data=chopdf.i, ntree=100)

#Predict results with each of the 8 models on the 4 datasets (different imputations methodologies when modeling for PTT and PT)      
       chop.ptt.lm.pred<- predict(lm.p.ptt, chopdf.ii)
       chop.ptt.rf.pred<- predict(rf.p.ptt, chopdf.ii)
       chop.pt.lm.pred<- predict(lm.p.pt, chopdf.i)
       chop.pt.rf.pred<- predict(rf.p.pt, chopdf.i)
       penn.ptt.lm.pred<- predict(lm.c.ptt, penndf.ii)
       penn.ptt.rf.pred<- predict(rf.c.ptt, penndf.ii)
       penn.pt.lm.pred<- predict(lm.c.pt, penndf.i)
       penn.pt.rf.pred<- predict(rf.c.pt, penndf.i)

chop.ptt.pred<-data.frame(pred_outputs.lm=chop.ptt.lm.pred,pred_outputs.rf=chop.ptt.rf.pred,obs_outputs=chopdf.ii$PTT)      
chop.pt.pred<-data.frame(pred_outputs.lm=chop.pt.lm.pred,pred_outputs.rf=chop.pt.rf.pred,obs_outputs=chopdf.i$PT)      
penn.ptt.pred<-data.frame(pred_outputs.lm=penn.ptt.lm.pred,pred_outputs.rf=penn.ptt.rf.pred,obs_outputs=penndf.ii$PTT)      
penn.pt.pred<-data.frame(pred_outputs.lm=penn.pt.lm.pred,pred_outputs.rf=penn.pt.rf.pred,obs_outputs=penndf.i$PT)          
 
 
calcresids<-function(df){
df%>%
    mutate(residual.lm = pred_outputs.lm - obs_outputs,
    residual.rf = pred_outputs.rf - obs_outputs,
    better=ifelse(abs(residual.lm)>abs(residual.rf),'Random Forest','Linear Model'))
}

#Create residuals tables
chop.ptt.pred<-calcresids(chop.ptt.pred)
chop.pt.pred<-calcresids(chop.pt.pred)
penn.ptt.pred<-calcresids(penn.ptt.pred)
penn.pt.pred<-calcresids(penn.pt.pred)

#Calculate R-squared
chop.ptt.rs<-resids(chop.ptt.pred)
chop.pt.rs<-resids(chop.pt.pred)
penn.ptt.rs<-resids(penn.ptt.pred)
penn.pt.rs<-resids(penn.pt.pred)


#Ready data for table
chop.ptt.rs<-chop.ptt.rs%>%transmute(Train='PENN',
                                 Test='CHOP',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
chop.pt.rs<-chop.pt.rs%>%transmute(Train='PENN',
                                 Test='CHOP',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
penn.ptt.rs<-penn.ptt.rs%>%transmute(Train='CHOP',
                                 Test='PENN',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
penn.pt.rs<-penn.pt.rs%>%transmute(Train='CHOP',
                                 Test='PENN',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
rss.c.ptt<-rs.c.ptt%>%transmute(Train='CHOP',
                                 Test='CHOP',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
rss.c.pt<-rs.c.pt%>%transmute(Train='CHOP',
                                 Test='CHOP',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
rss.p.ptt<-rs.p.ptt%>%transmute(Train='PENN',
                                 Test='PENN',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)
rss.p.pt<-rs.p.pt%>%transmute(Train='PENN',
                                 Test='PENN',
                                 'R-Squared LM'=r.squared.lm,
                                 'R-Squared RF'=r.squared.rf)

ptt.models<-bind_rows(chop.ptt.rs,penn.ptt.rs,rss.c.ptt,rss.p.ptt)%>%arrange(Test)
pt.models<-bind_rows(chop.pt.rs,penn.pt.rs,rss.c.pt,rss.p.pt)%>%arrange(Test)

The following data demonstrates some interesting patterns in the performance of models for predicting PTT results. They demonstrates that overall the models performed better when being cross-validated against the same source data. However the RF model suffered more degradation in performance when tested against data from another hospital, whereas the LM model weathered this change better. Therefore, when predicting PTT results from a new data source the LM model actually outperformed the RF model.

#Build table with shading for PTT
library(condformat)
condformat(ptt.models) +
rule_fill_gradient(3, expression= ptt.models[,3]*100,low = rgb(1,1,1), high = rgb(1,0,0)) +
rule_fill_gradient(4, expression= ptt.models[,4]*100,low = rgb(1,1,1), high = rgb(1,0,0))
Train Test R-Squared LM R-Squared RF
1 PENN CHOP 0.3286476 0.2516059
2 CHOP CHOP 0.3522798 0.4273829
3 CHOP PENN 0.4563122 0.4304846
4 PENN PENN 0.5264818 0.6666126

By contrast, the following table summarizes the performance of the models for predicting PT results. The table demonstrates that for PT results, no matter which training set or test set was used, the RF model was superior. Surprisingly, and in contrast to the pattern see for PTT models, the models trained on CHOP data outperformed the PENN model even when the tested data came from PENN. The table shows a robust ability of the CHOP model to predict PT results across hospital with wildly different patient populations. By far the worst performance was seen when the PENN models were applied to CHOP data. As with the PTT models, both the LM and RF approaches did not perform well when applied to the CHOP dataset.

#Build table with shading for PT
condformat(pt.models) +
rule_fill_gradient(3, expression= pt.models[,3]*100,low = rgb(1,1,1), high = rgb(1,0,0)) +
rule_fill_gradient(4, expression= pt.models[,4]*100,low = rgb(1,1,1), high = rgb(1,0,0))
Train Test R-Squared LM R-Squared RF
1 PENN CHOP 0.2880126 0.5089754
2 CHOP CHOP 0.4491557 0.6108817
3 CHOP PENN 0.3937582 0.6273221
4 PENN PENN 0.5058017 0.6707622

Examining Variable importance in final modes

The importance of the factors that contributed to the final models were examined for each model

beta.PTT<-data.frame('PENN LM PTT'=summary(lm.p.ptt)[[4]][2:8,1],'CHOP LM PTT'=summary(lm.c.ptt)[[4]][2:8,1])
gini.PTT<-data.frame(importance(rf.p.ptt),importance(rf.c.ptt))
colnames(gini.PTT)<-c('PENN RF PTT','CHOP RF PTT')
beta.PTT
##                  PENN.LM.PTT CHOP.LM.PTT
## log(FactorVIII)  -7.80579707   -5.112274
## log(FactorXI)    -5.96621808   -9.590423
## log(FactorIX)   -10.33868843  -13.053927
## log(FactorXII)  -12.62160145   -4.217527
## log(FactorII)    -6.63711657   -3.249932
## log(FactorV)     -6.74314464   -6.829557
## log(FactorX)     -0.05857498    1.321596

Examination of Beta coefficients for models for regression of PTT indicated that both PENN and CHOP classiffiers utilized intrinsic pathway factors to a large degree. A siginificant difference was the weight placed on Factor VIII in the penn data and the relatively greater weight placed on Factor XI in the CHOP data. Interestingly both favored factor V in the models, the reasoning is unclear but perhaps factor V may be interpreted as relatively frequently affecting the PTT in the patients examined

gini.PTT
##            PENN RF PTT CHOP RF PTT
## FactorVIII   248256.60   108760.34
## FactorIX     108702.96   240322.01
## FactorXI      48720.52   189875.50
## FactorX       24608.51    61857.79
## FactorII      33653.52    72563.79
## FactorV       36289.23    80770.33
## FactorXII     26570.54   102647.70

Amongst the random forest regression models for PTT, the CHOP model mirrored the CHOP LM to a large defree favoring extrinsic pathway, with somewhat more emphasis on Factor XII and perhaps less on Factor V. In contrast the PENN model appeared to favor just Factor VIII and Factor IX, without very significant use of other variables.

beta.PT<-data.frame('PENN LM PT'=summary(lm.p.pt)[[4]][2:5,1],'CHOP LM PT'=summary(lm.c.pt)[[4]][2:5,1])
gini.PT<-data.frame(importance(rf.p.pt),importance(rf.c.pt))
colnames(gini.PT)<-c('PENN RF PT','CHOP RF PT')
beta.PT
##                PENN.LM.PT CHOP.LM.PT
## log(FactorVII)  -3.110773  -3.735410
## log(FactorX)    -9.394748  -3.400584
## log(FactorII)    3.788991  -2.895850
## log(FactorV)    -4.109658  -1.753221

Amongst the models for PT values, the CHOP model incoroprated largely all independent variableswith Factor V the least robust. Interestingly the PENN model was strikingly different favoring Factor X and finding an association of higher FactorII results with higher PT’s, an association that does not seem intuitive.

gini.PT
##           PENN RF PT CHOP RF PT
## FactorVII   3752.409   33228.45
## FactorX    11645.099   23084.92
## FactorII    3547.185   19713.92
## FactorV     3219.276   19525.78

The random forest models tended to mirror the LM models with importance of variables in the CHOP model evenly spread whereas in the PENN model FactorX again appeared to be driving prediction.

Conclusion

The presented results validate several assumptions about the behavior and relationships amongst these tests. * The dependence and contribution of common factors to both the PTT and PT were verified as well as the contribution of intrinsic and extrinsic pathways, respectively. * The logarithmic relationship between individual assays was seen as well confirming what has long been appreciated in the laboratory.

Finally the data support the possibility of useing regression models for the prediction of PTT and PT.Interestingly, for the PT, the CHOP trained RF model was superior to the PENN model even when predicting results from PENN. This may be due to the greater number of datapoints available from CHOP, or perhaps the fact that the CHOP data draws from a longer time period makes it more extensible, though this alone would not explain why the PENN model did not work as well on PENN’s data.

Future work will look to increase the datapoints from PENN, and incorporate additional clinical data, including concomitant use of anticoagulants, and other clinical covariates. I hope to evaluate new models to predict whether the PT and PTT are normal or abnormal. This will provide clinical benefit, similar to this analysis, in the form of predicting whether certain factor values are sufficient to explain an abnormal screening assay. Furthermore the Reptilase test and fibrinogen assays were not examined here. The reptilase test is a test that is thought to be solely dependent on fibrinogen levels, however an interesting clinical question relates to how low a fibrinogen result has to be to cause the reptilase test to be abnormal.